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Abstract. We describe features of the LSST science database that are amenable to scientific data 
mining, object classification, outlier identification, anomaly detection, image quality assurance, and 
survey science validation. The data mining research agenda includes: scalability (at petabytes scales) 
of existing machine learning and data mining algorithms; development of grid-enabled parallel data 
mining algorithms; designing a robust system for brokering classifications from the LSST event 
pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods 
for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical 
databases (beyond spatial indexing) for rapid querying of petabyte databases; and more. 
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DATA-INTENSIVE ASTRONOMY AND THE LSST SKY SURVEY 

The development of models to describe and understand scientific phenomena has his- 
torically proceeded at a pace driven by new data. The more we know, the more we are 
driven to tweak or to revolutionize our models, thereby advancing our scientific under- 
standing. This data-driven modeling and discovery linkage has entered a new paradigm 
rill]. The acquisition of scientific data in all disciplines is now accelerating and causing 
a nearly insurmountable data avalanche . In astronomy in particular, rapid advances 
in three technology areas (telescopes, detectors, and computation) have continued un- 
abated isj] - all of these advances lead to more and more data With this accelerated 
advance in data generation capabilities, we will require novel, increasingly automated, 
and increasingly more effective scientific knowledge discovery systems [5]. 

Astronomers have been doing data mining for centuries: "the data are mine, and you 
can't have them!" . Seriously, astronomers are trained as data miners, because we are 
trained to: (a) characterize the known {i.e., unsupervised learning, clustering); (b) assign 
the new {i.e., supervised learning, classification); and (c) discover the unknown {i.e., 
semi-supervised learning, outlier detection) L^, These skills are more critical than 
ever since astronomy is now a data-intensive science, and it will become even more 
data-intensive in the coming decade New surveys may produce hundreds of 

terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and 
in the object catalogs (databases). Discovering the ensuing hidden wealth of new scien- 
tific knowledge will require more sophisticated algorithms and networks that discover, 
integrate, and learn from distributed petascale databases more effectively rillll . lll2ll . 



The problem therefore is this: astronomy researchers will soon (if not already) lose the 
ability to assimilate or to keep up with any of these things: the data flood, the scientific 
discoveries buried within, the development of new models of those phenomena, and the 
resulting new data-driven follow-up observing strategies that are imposed on telescope 
facilities to collect new data needed to validate and augment new discoveries. 

One of the most impressive astronomical sky surveys being planned for the next 



decade is the Large Synoptic Survey Telescope project (LSST at www.lsst.org) [[130. 
The three fundamental distinguishing astronomical attributes of the LSST project are: 

1. Repeated temporal measurements of all observable objects in the sky, correspond- 
ing to thousands of observations per each object over a 10-year period, expected 
to generate 10,000-100,000 alerts each night to the astronomical research commu- 
nity that something has changed at that location on the sky: either the brightness or 
position of an object, or the serendipitous appearance of some totally new object; 

2. Wide-angle imaging that will repeatedly cover most of the night sky within 3 to 4 
nights (= tens of billions of objects); and 

3. Deep co-added images of each observable patch of sky (summed over 10 years: 
2015-2025), reaching far fainter objects and to greater distance over more area of 
sky than other sky surveys [14]. 

Compared to other astronomical sky surveys, the LSST survey will deliver time 
domain coverage for orders of magnitude more objects. It is envisioned that this project 
will produce ~30 TB of data per each night of observation for 10 years. The final 
image archive will be greater than 60 PB (and possibly much more), and the final LSST 
astronomical object catalog (object-attribute database) is expected to be ~ 10-20 PB 
(or more). Additional information about the LSST survey and scientific program are 
described by Ivezic et al. [|15n and provided elsewhere in these proceedings [16j. 

Since it is anticipated that LSST will generate many thousands (probably tens of thou- 
sands) of new astronomical event alerts per night of observation, there is a critical need 
for innovative follow-up procedures. These procedures necessarily must include mod- 
eling of the events - to determine their classification, time-criticality, astronomical rel- 
evance, rarity, and the scientifically most productive set of follow-up measurements. 
Rapid time-critical follow-up observations, with a wide range of time scales from sec- 
onds to days, are essential for proper identification, classification, characterization, anal- 
ysis, interpretation, and understanding of nearly every astrophysical phenomenon {e.g., 
supernovae, novae, accreting black holes, microquasars, gamma-ray bursts, gravitational 
microlensing events, extrasolar planetary transits across distant stars, new comets, in- 
coming asteroids, trans-Neptunian objects, dwarf planets, optical transients, variable 
stars of all classes, and anything that goes "bump in the night") 



Petascale Mining of Large Astronomical Sky Surveys 

LSST and similar large sky surveys have enormous potential to enable countless 
astronomical discoveries. Such discoveries will span the full spectrum of statistics: 
from rare one-in-a-billion (or one-in-a-trillion) type objects, to a complete statistical 



and astrophysical specification of a class of objects (based upon millions of instances of 
the class). One of the key scientific requirements of these projects therefore is to learn 
rapidly from what they see. This means: (a) to identify the serendipitous as well as the 
known; (b) to identify rare events that our models say should be there; (c) to identify 
new classes of objects that fall outside the bounds of model expectations; (d) to find new 
attributes of known classes; (e) to provide statistically robust tests of existing models; 
and (f) to generate the vital inputs for new models. All of this requires integrating and 
mining all known data: to train classification models and to apply classification models. 

LSST alone is likely to throw such data mining and knowledge discovery efforts into 
the petascale realm. For example: astronomers currently discover a few hundred new su- 
pemovae per year. Since the beginning of human history, perhaps ~ 10,000 supemovae 
have been recorded. Because the identification, classification, and analysis of supernovae 
enable fundamental (Dark Energy) science, it is imperative for astronomers to respond 
quickly to each new event with rapid follow-up observations in many measurement 
modes (light curves; spectroscopy; and images of the host galaxy and its environment). 
Historically, with <10 new supernovae being discovered each week, such follow-up has 
been feasible. But now, LSST promises to produce a list of 1000 new supernovae each 
night for 10 years fi4\, which represent a small fraction of the total (10-100 thousand) 
alerts expected each night! Astronomers are faced with the enormous challenge of effi- 
ciently mining, correctly classifying, and intelligently prioritizing a staggering number 
of new events for follow-up observation each night for a decade. 

The major features and contents of the LSST scientific database include: >100 
database tables; image metadata (675M rows); source catalog (260B rows); object cat- 
alog (22B rows, with 200-1- attributes); moving object catalog; variable object catalog; 
alerts catalog; calibration metadata; configuration metadata; processing metadata; and 
provenance metadata. The science archive will consist of ~2000 images per night (for 10 
years), comprising 60-100 PB of pixel data. This enormous LSST data archive and ob- 
ject database enables a diverse multidisciplinary research program: astronomy & astro- 
physics; machine learning (data mining); exploratory data analysis; XLDB (extremely 
large databases); scientific visualization; computational science & distributed comput- 
ing; and inquiry-based science education (using data in the classroom). 

Many possible scientific data mining use cases are anticipated with the LSST 
database, including: 

• Provide rapid probabilistic classifications for all 10,000 LSST events each night; 

• Find new "fundamental planes" of correlated astrophysical parameters {e.g., the 
fundamental plane of Elliptical galaxies) [19]; 

• Find new correlations, associations, relationships of all kinds from lOO-i- attributes 
in the LSST science database, integrated with distributed VO-accessible data; 

• Compute multi-point multi-dimensional correlation functions over the full panoply 
of astrophysical parameter spaces; 

• Discover zones of avoidance in interesting parameter spaces {e.g., period gaps); 

• Discover new properties of known classes; 

• Discover new and improved rules for classifying known classes of objects {e.g., 
photometric redshifts) f.20.1 ; 



Discover new and exotic classes of astronomical objects; 
Identify novel, unexpected temporal behavior in all classes of objects IT 
Hypothesis testing - verify existing (or generate new) astronomical hypotheses 
with strong statistical confidence, using millions of training samples; 
Serendipity - discover rare one-in-a-billion objects through novelty detection; 
Image processing - identify non-astronomical features, classify them, and separate 



them from the astronomical catalog inputs 11211 12211; and 

• Quality assurance - identify system glitches, instrument anomalies, and pipeline 
errors through near-real-time deviation detection. 

Some of the data mining research challenge areas posed by the arrival of petascale 
scientific databases include: 

• indexing and associative memory techniques (trees, graphs, networks) for multi- 
attribute (highly-dimensional) astronomical databases (beyond RA-Dec indexing); 

• scalability of statistical, computational, machine learning, and data mining algo- 
rithms to multi-petabyte scales; 

• algorithms for optimization of simultaneous multi-point fitting across massive 
multi-dimensional data cubes; 

• multi-resolution methods and structures for exploration of petascale databases; 

• petascale analytics for visual exploratory data analysis of massive databases; and 

• rapid query, search, and retrieval algorithms for petabyte databases. 

Additional and more in-depth discussion of the petascale data challenges posed by 
the LSST sky survey are available (at www.lsst.org/Project/docs/data-challenge.pdf and 
universe . ucdavi s . edu/doc s/LS S T_petascale_challenge .pdf) . 



A Classification Broker for Astronomy 

We envision an astroinformatics (data-intensive astronomy) research paradigm (for 
data integration and mining) to address the petascale needs of large astronomical surveys 
[l8i|23|]. The impending data loads surpass those of the Sloan Digital Sky Survey by 1000- 
10,000 times, while the time-criticality requirement (for event/object classification and 
characterization) drastically drops from months (or weeks) down to minutes (or tens of 
seconds). In addition to the follow-up classification problem (described earlier), we will 
want to find every possible new scientific discovery (pattern, correlation, relationship, 
outlier, new class, etc.) buried within these new enormous databases. This might lead to 
a petascale data mining compute engine that runs in parallel alongside the data archive 
- to test every possible N-point correlation, multi-parameter association, and classifica- 
tion rule. In addition to such a "batch discovery machine", a rapid-response data mining 
engine {i.e., classification broker) is needed in order to produce and distribute scientif- 
ically robust near-real-time classifications of astronomical sources, events, objects, or 
event host objects (e.g., we need the redshift of the host galaxy in order to interpret 
and classify a supernova accurately) ll23i|24j,|25|]. These classifications are derived from 



integrating and mining data, information, and knowledge from multiple distributed VO- 
accessible data repositories, robotic telescopes, and astronomical alert networks world- 
wide. Incoming event alert data will be subjected to a suite of machine learning (ML) 
algorithms for event classification, outlier detection, object characterization, and novelty 
discovery [18, 23, 24, 25, 2^ 27]. Probabilistic ML models will produce rank-ordered 
lists, to guide follow-up observations on the 10-lOOK alertable astronomical events that 
will be identified each night by the LSST sky survey alone. The classification broker will 
thereby enable rapid follow-up science for the most important and exciting astronomical 
discoveries of the coming decade, on a wide range of time scales from seconds to days, 
corresponding to a plethora of exotic astrophysical phenomena. 
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